h to 


ICSearc 


from Elast 


Grafana Loki 


How we reduced logs costs by 
moving 


Igor Latkin 


WHO AM 1% 


IGOR LATKIN 


Co-founder & System Architect @ KTS 


e Corporate systems 

e Non-standard projects 
e Mobile 

e DevOps 


AGENDA Load 


1. Log collection task 
2. Loki architecture 
3. Our journey of logs transferring from ES to Loki 


4. (bonus) Loki configuration tips & tricks 


WHAT THIS TALK IS NOT 


e Loki or Elasticsearch tutorial 
e Loki vs Elasticsearch comprehensive comparison 


e Complete set of instructions how to transfer logs in your 
environment 


Armenia 


LOG COLLECTION PROBLEM High 


How to collect 


Armenia 


How to store 


How to extract " A 
metadata PW си асан 


MULTIPLE 


KUBERNETES CLUSTER 


SOURCES 


VIRTUAL MACHINES 


«» 
= 


DATABASES 


TRAFFIC SNIFFING 


CLOUD ENVS 


+ 
+ 


SYSLOG 


WINDOWS EVENTS 


DOCKER 


High 
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++ 


LOGGING PIPELINE Ка 


Armenia 


KUBERNETES CLUSTER 


Storage 


Collect 


DATABASES 


Process 
Filter 
VIRTUAL MACHINES 


User 


DOCKER 


WE ALL KNOW THEM ue 


62, 
rSYSLOG Ф? elastic 
== 


graylog 


ЇЇ! ClickHouse 


I 


Grafana loki 


—> LOKI ARCHITECTURE 


++ 
І 
Load 
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query-frontend | distributor 


поз 
инээх 
Caches 
ingester 


YÇ — 
= 


index-gateway 


compactor 


INDEXING ue 


2019-12-11Т10:01:02.1234567897 {app="nginx”,cluster="us-west1”} GET /about 


Timestamp Prometheus-style Labels Content 
with nanosecond precision key-value pairs logline 


фе 200 


indexed unindexed 


++ 


STREAMS Losa 


Logs 
Labels Message 
{component="printer", location="f2c16", level="error"} "Out of paper" 
chunk #1 
stream 1 
{component="printer", location="f2c16", level="error") "Тоо much paper" 
EEE 
chunk #2 
{component="supplier", location="f2c16", level="info"} "Paper exhasted" stream 2 
_ ci 


CHUNKS & INDEX 


CHUNKS & INDEX > 53 не 


ў y 


BOLTDB-SHIPPER TSDB 


GRPC call 


— > ELASTICSEARCH > LOKI 


WHY WE DECIDED 
TO MOVE? 


MOTIVATION TO DITCH ELASTIC å 


1 Huge money burden 


2 Hard maintainability 


ELASTICSEARCH-BASED INFRA ne 


Elasticsearch cluster 


app node 


filebeat 
packetbeat 
master nodes coordinator nodes 


data nodes 


app node 


MIGRATION PLAN 


O N O сл R о N нын 


. Deploy Loki cluster into Kubernetes 

. Deploy S3 storage 

. Start collecting live logs in Loki 

. Solve troubles, constantly reconfigure Loki to handle the load 
. Transfer old logs 

. Fail multiple times 

. Succeed, wait for the transfer to complete 


. Demolish Elasticsearch cluster 


MIGRATION PLAN 


1. Deploy Loki cluster into Kubernetes 
2. Start collecting live logs in Loki 


3. Transfer old logs 


Armenia 


COLLECTING 
LIVE LOGS IN LOKI 


++ 


NOT BREAKING EVERYTHING Losa 


app node 


packetbeat-parallel 


Elasticsearch 


app node 


COLLECTING 


-> Application logs 

-> Packetbeat logs 

-> Postgres logs (new) 
-> 1С logs 


LOGS FROM FILES #5 


Armenia 


scrape_configs: 
- job_name: system 
static_configs: 


- targets: 
- Localhost 
Labels: 
job: отпі services 
__path__: /var/log/omni/services/*/*log 


- targets: 
- Localhost 
Labels: 
job: packetbeat 
__path__: ££ promtail_packetbeat_folder }}/output 


- targets: 
- localhost 
Labels: 
job: postgres_logging 
__path__: {{ promtail_postgres_folder 11/*/Log/*Log 


EXTRACTED LABELS Le 


Armenia 


node_name (4) 


camunda prod 1C_logging DEBUG 1c-prod-2 
catalog omni_services ERROR prod-db01 
cession-service postgres_logging INFO prod-linux-1 
customer-checker TRACE prod-linux-2 
customer-data WARN 
digital-subscriptions 
discount-service 

22 Z 22 2 


domain (64) http_method (7) http_status (19) network_direction (2) 


CONNECT 200 egress Error 
DELETE 204 ingress OK 
GET 301 

HEAD 302 

OPTIONS 303 

POST 304 

PUT 400 

4 4 


TRANSFER OLD LOGS 


WHAT CONCERNED US? не 


— 1 year of log data 
— 25 TB of storage 


INITIAL ASSUMPTION Le 


Elasticsearch cluster 


2021-01-01 10:59:02 
2021-01-01 11:20:12 
2021-01-02 11:20:12 
2021-02-01 04:17:10 
2021-02-01 10:17:10 


interval 1 


interval 2 


2021-12-20 07:10:10 
2021-12-31 23:59:59 


interval N 


INITIAL ASSUMPTION 


Elasticsearch cluster 


Logs 


2021-01-01 10:59:02 
2021-01-01 11:20:12 
2021-01-02 11:20:12 
2021-02-01 04:17:10 
2021-02-01 10:17:10 


2021-12-20 07:10:10 
2021-12-31 23:59:59 


— 


— 


> 


interval 1 


interval 2 


interval N 


Transfer utility 


EN Buffer 


Preprocess Sender 
SE 

Preprocess —> Buffer Sender 
@ 

Ргергосеѕѕ Sender 


EM Buffer 


High 
i 
Load 
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++ 
High 


LOKI CONFIG Load 


Limits_config: 
reject_old_samples: false 


ingestion_rate_mb: 100 
ingestion_burst_size_mb: 30 


per_stream_rate_limit: "150МВ" 


max_streams_per_user: 0 
max_global_streams_per_user: 0 


retention_period: 8928h 


TESTING OUR ASSUMPTION 


entry too far behind, oldest 
acceptabLe timestamp is: 
2021-01-01T01:00:00 


Armenia 


++ 


WHY IS THAT2 года 


369 // The validity window for unordered writes is the highest timestamp present minus 1/2 * max-chunk-age. 
* 370 cutoff := highestTs.Add(-s.cfg.MaxChunkAge / 2) 
371 if !isReplay 66 s.unorderedWrites && !highestTs.IsZero() && cutoff.After(entries[i].Timestamp) 4 
372 failedEntriesWithError = append(failedEntriesWithError, entryWithError{&entries[i], chunkenc.ErrTooFarBehind(cutoff)}) 
373 outOfOrderSamples++ 
374 outOfOrderBytes += lineBytes 
375 continue 
376 } 


This is the error 


https://github.com/grafana/loki/blob/c75b822fc6998ca5/bf5345lec6dc2038c/cla5e/pkg/ingester/stream.go#L370 


MIND үші 
EXPERIMENT 


Armenia 


interval 1 ? 


Accepted S ёс) 49 L © 
timestamps: хо хо хо ФА хо 
| timeline 
highestTimestamp 


Minimal accepted timestamp is: 


highestTimestamp - maxChunkAge / 2 


TWEAK CONFIGURATION 


ingester: 
max chunk age: 8760h + 365d 


Armenia 


SEEMS TO BE FIXED High 


Armenia 


Accepted S o 
timestamps: ке“ 5 


| timeline 
highestTimestamp 


DID WE FOOL LOKI? 


NOT REALLY 


LOKI WON 


Query error 


Query error 


Query error 


Query error 


too many unh 


stances in the ring 


High 
i 
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WHAT IS HAPPENING? Le 


© #5963 Context canceled error © Closed 

С) #4015 Query from Grafana to Loki returning 502 when using 7 days... © Open H 
(2 #3524 Grafana "bad gateway" | Logcli "EOF", while frontend says "s... © Closed SB 
© #2540 Querier high memory demands © Closed 

(2 #3753 [loki distributed] slow query (high cpu usage) and time out © Closed * 


It is time to stop guessing and start thinking 


++ 


BLOCK & CHUNK года 


MagicNumber(4b) І version(1b) 
| 
PP block-1 bytes | checksum (4b) 
ts (varint) len Cuvarint) Под. Mbytes melts PP ina A O 7 
block-2 bytes І checksum (4b) 
ts (varint) len Cuvarint) | log-2 bytes | ШЕТІН | GEI 
ts (varint) len Cuvarint) log-3 bytes #blocks Cuvarint) 
ts (varint) len Cuvarint) log-n bytes fentries(uvarint) | mint, maxt (varint) | offset, len Cuvarint) 


++ 


BLOCK & CHUNK года 


ts (varint) len Cuvarint) log-1 bytes 
a Шоо АРИЈЕ 
KS EE 
n ОИ о 


INVESTIGATION STAGE Le 


302 for i := 0; i < len(entries); i++ 4 
303 chunk := &s.chunks[len(s.chunks)-1] 
304 if chunk.closed || !chunk.chunk.SpaceFor(&entries[il) || s.cutChunkForSynchronization( 
305 chunk = s.cutChunk(ctx) 
306 } 
307 
308 chunk. lastUpdated = time.Now() 
ses )309 | if err := chunk.chunk.Append(&entries[i]); err != nil { 
310 invalid = append(invalid, entryWithErroriGentries[i], err}) 
311 if chunkenc.IsOutOfOrderErr(err) { 
312 outOfOrderSamples--- 
313 outOfOrderBytes += len(entries[il.Line) 
314 } 
315 continue 
316 } 


https://github.com/grafana/loki/blob/c75b822fc6998ca57bf53451ec6dc2038c7clabe/pkg/ingester/stream.go#L309 


++ 


INVESTIGATION STAGE года 


667 // Append implements Chunk. 
668 func (с *MemChunk) Append(entry *logproto.Entry) error { 


669 entryTimestamp := entry.Timestamp.UnixNano() 
670 
671 // If the head block is empty but there are cut blocks, we have to make 
672 // sure the new entry is not out of order compared to the previous block 
673 if c.headFmt < UnorderedHeadBlockFmt && c.head.IsEmpty() && len(c.blocks) > @ && c.blocks[len(c.blocks)-1].maxt > entryTimestamp { 
674 return ErrOutOfOrder 
675 } 
676 
ser |677 if err := c.head.Append(entryTimestamp, entry.Line); err != nil { 
678 return err 
679 } 
68@ 
681 if c.head.UncompressedSize() >= c.blockSize { 
682 return c.cut() 
683 } 
684 
685 return nil 
686 } 


https:/github.com/grafana/loki/blob/c75b822fc6998ca57bf53451ec6dc2038c7cla5e/pkg/chunkenc/memchunk.go#L677 


INVESTIGATION STAGE Le 


It just 104 func (hb xunorderedHeadBlock) Append(ts int64, line string) error { 
updates the 

133 // Update hb metdata 
boundaries if hb.size == 0 || hb.mint > ts { 


hb.mint = ts 


if hb.maxt < ts { 


139 hb.maxt = ts 
140 } 

141 

142 hb.size += len(line) 
143 hb. Lines++ 

144 

145 return nil 

146 } 


https://github.com/grafana/loki/blob/c75b822fc6998ca57bf53451ec6dc2038c7cla5e/pkg/chunkenc/unordered.go#L104 


SO WHAT? 


BLOCKS MIX UP 


timeline 


Head Block 


Armenia 


block #1 
ПОО; 350] 


block #2 
[104; 360] 


block #10 
[110; 380] 


CHUNKS MIX UP AS WELL іші 


block #1 block #1 block #1 
ПОО; 350] [112; 381] [117; 200] 

block #2 block #2 block #2 
[104; 360] [115; 390] [260; 500] 


block #10 block #10 block #10 
1110: 380] (116: 720] (120: 800] 


QUERY LOGS 


Chunk 


block #1 
ПОО; 350] 


block #2 
[104; 360] 


block #10 
1110: 380] 


IN [120; 200] 


Chunk 


block #1 


(112; 381] 


AA) 


(б > 
block #2 
[115; 390] 


VENERE" 
block #10 
[116; 720] 


Chunk 


block #1 
[117, 200] 


block #2 
[260; 500] 


block #10 
[120; 800] 
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QUERY LOGS IN [120;200] не 


Armenia 


e Loki is forced to download almost all chunks from the storage 
e Process in memory 


e Resort & filter 


Chunk Chunk Chunk 
block #1 block #1 block #1 
ПОО; 3501 [112; 381] [117; 200] 
SN 
block #2 block #2 block #2 
[104; 360] [115; 390] [260; 500] 


р 


block #10 block #10 block #10 
[110; 380] [16; 720] (120: 800] 


QUERY LOGS IN [120;200] не 


Armenia 


e Loki is forced to download almost all chunks from the storage 
e Process in memory 


e Resort & filter 


Chunk Chunk Chunk 
block #1 block #1 block #1 
ПОО; 3501 [112; 381] [117; 200] 
SN 
block #2 block #2 block #2 
104; 360] [115; 390] [260; 500] 
www 
block #10 block #10 block #10 
[110; 380] [116; 720] [120; 800] 
https://github.com/grafana/loki/issues/2540 
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High 


HOW TO FIX EVERYTHING? Long 


Logs 


Labels Message 


"Out of paper" 


(component-"printer", location="f2c16", level="error"} 
chunk #1 
stream1 
(component-"printer", location-"f2c16", level-"error" "Too much paper" — 
— 
chunk #2 
(component-"supplier", location-"f2c16", level="info"} "Paper exhasted" stream 2 


) 


HOW TO FIX EVERYTHING? Lond 


1. Don't set max chunk, age to a high value 
2. Don't send logs in parallel for the same set of labels (i.e. a stream) 
3. You may create additional streams by adding new labels 


4. Send logs within a stream only in the strict order 


++ 


High 


JUST ADD LABELS Long 


Elasticsearch cluster 


Transfer utility 


Logs 
transfers 
2021-01-0110:59:02 Мапа оу) 
2021-01-01 11:20:12 interval 1 Preprocess --» Buffer —> Sender 


ә (7 Gr 


Tansfer e 


Preprocess —> Buffer —> Sender ---- 
АА ыі 


2021-01-02 11:20:12 
2021-02-01 04:17:10 
2021-02-01 10:17:10 


->» interval 2 ши 
oki 


2021-12-20 07:10:10 m> interval М 
2021-12-31 23:59:59 


Preprocess ——» Buffer - > Sender 


QUERY LOGS IN [120: 200] Hee 


Chunk 


block #1 block #1 block #1 
[100; 120] 1141, 160] [221; 250] 
uc E) 

EE ОО 
block #2 block #2 block #2 
[121; 130] [161; 200] [251; 270] 
A) 


block #10 block #10 block #10 
[131; 140) [201; 220] [291; 300] 


CHUNKS SITUATION Losa 


1 Query matches less chunks 


N 


2 All blocks within all chunks are strictly ordered 


N 


3 Fast and efficient query execution 


TRANSFER 
IMPLEMENTATION 
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Transfer utility 


Logs 
Processor 


ES Scroller 


| Loki 
Sender 


chunks queue å 


Elasticsearch 
cluster 


Chunker 


ES SCROLLER не 


Sort by (@timestamp, log.offset) 
Use search_after to scroll & resume search 
Do not let a consumer wait for data 


Handle errors properly 


IRAN = 


Some shards may return no data 


ALWAYS WAIT FOR ERROR RESOLUTION - WE CANNOT LOOSE LOG LINES 


LOGS PROCESSOR 


PERFORMS A SIMILAR PROCESSING AS PROMTAIL 


Extract timestamp 
Extract labels 


Filter logs 


ь ој = 


Make а unique stream to ensure parallelism 


High 
i 
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LOGS PROCESSOR = 


class Transfer (BaseTransfer ): 
def extract_doc_labels(self, source: dict) > Optional[MutableMappinglstr, str]]: 
return dict ( 
app=source.get("fields", {}).get("service_name"), 
job-"Logs", 
Level=source.get("Level"), 
node name=source.get("host", {}).get("name"), 
Logger .name=source.get ("Logger name"), 


LOKI SENDER 


1 Send data to /loki/api/v1/push 
2 Send data sequentially 
3 Use only 1 worker (for now) 


4 Support for different protocols 


LOKI SENDER 


1 Send data to /loki/api/v1/push 
2 Send data sequentially 
3 Use only 1 worker (for now) 


4 Support for different protocols 


HTTPS: //GITHUB .COM/KTSSTUDIO/ES2LOKI 


Armenia 
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YOU CAN ALREADY USE IT = 


class Transfer (BaseTransfer ): 
def extract_doc_labels(self, source: dict) > Optional[MutableMappinglstr, str]]: 
return dict ( 
app=source.get("fields", {}).get("service_name"), 
job-"Logs", 
level=source.get("level"), 
node_name=source.get("host", {}).get("name"), 
logger. name-source.get("logger. name"), 


ELASTIC_HOSTS=http://localhost:9200 N 
ELASTIC_INDEX="filebeat-*" \ 
LOKI_URL=http://Localhost:3100 \ 
python ./transfer.py 


YOU CAN ALREADY USE IT ue 


Armenia 


JUST ADD YOUR OWN DOCKER IMAGE, WE'LL DO THE REST: 


$ helm repo add kts https://charts.kts.studio 
$ helm repo update 
$ helm upgrade --install \ 
RELEASE. NAME \ 
kts/es2loki \ 
--set image.repository-your-docker-image 
--set image.tag=Latest 


SEE HOW IT WORKS 


Components 
1. Elasticsearc h 
2. Kibana 


3. filebeat (imports "old" logs to Elasticsearch) 
4. Grafana 

5. Loki 

6. Promtail (imports "new" logs to Loki) 

7. PostgreSQL (needed for es2loki) 

8. es2loki 


Usage 
In order to run a demo you may use: 


docker compose up 


HTTPS: //GITHUB.COM/KTSSTUDIO/ES2LOKI 


ANY PROFIT? 
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LOKI IN KUBERNETES Losa 


ods(loki)[19] 
PF READY RESTARTS STATUS MEM CPU/R:L MEM/R:L %CPU/R %CPU/L %MEM/R %MEM/L 

Loki-compactor-565b5b8565-wczqw D 1 Running 

loki-distributor-64b94c49fc-8kpxl Running :0 100:2048 - .88.30 wnode1 
loki-distributor-64b94c49fc-ffq2z Running :0 100:2048 5 .73.103  wnode2 
loki-distributor-64b94c49fc-q9ckb Running :0 100:2048 5 .126.26  wnode3 
loki-gateway-55dcf67678-w8qww Running :0 50:500 2 .73.134 мпойе2 
loki-index-gateway-0 Running :0 50:200 > .88.22 wnode1 
loki-ingester-0 Running 0 3072:7168 - .88.207 мпойе1 
loki-ingester-1 Running 0 3072:7168 4 .88.185 мпойе1 
loki-ingester-2 Running 0 3072:7168 5 .88.216 wnodei 
loki-memcached-frontend-0 Running 0 50:0 2 .88.245 мпойе1 
loki-memcached-frontend-1 Running 0 50: 5 .126.24 wnode3 
loki-memcached-frontend-2 Running :0 50: 5 .73.88 wnode2 
loki-memcached-index-queries-0 Running 0 100: 5 .88.26 wnodel 
loki-memcached-index-queries-1 Running 0 100: 5 126.92  wnode3 
loki-memcached-index-queries-2 Running 0 100:0 5 .73.86 wnode2 
loki-guerier-866486df4d-21hp7 Running 0 1024:5120 а .126.18 | wnode3 
loki-querier-866486df4d-lw94f Running 0 1024:5120 . .73.101 wnode2 
loki-querier-866486df4d-rmzw8 Running 0 1024:5120 6 .88.58 wnodel 
loki-query-frontend-7f89d8c6c7-5wn9j Running 0 300:5120 5 .126.219 wnode3 
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<namespace> 


ELASTICSEARCH VS LOKI Le 


Elasticsearch cluster 


master nodes 


Loki cluster 


( distributor 


query-frontend 


coordinator nodes 


Caches 


data nodes 


даа! data2 =) 
data4 data5 data6 


querier 


===) 
(== 


index-gateway 


compactor 


Storage 


ы 
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NEW STATELESS ELASTICSEARCH 
ARCHITECTURE 


Existing 
Architecture => ~~ 
чр / 
/ Ingest `\ (оо =) < Users 
Hot Primary со N \ 
Clients — я Clients 
L \. | 
== Replica %--- 
Мем/ 
Architecture = 
— |^ дива operation 
Ingest Users 
i - 
Clients =. Clients 
B а. Indexing Tier А 
ke pS A 


https://www.elastic.co/blog/stateless-your-new-state-of-find-with-elasticsearch 


INFRA COMPARISON 


ELASTICSEARCH 


—> CPU: 44 

— Memory: 232 GB 
-> Main disks: 338 GB 
`> Storage disks: 25 ТВ 


LOKI 


> CPU: 7 

> Memory: 14 GB 
-> Main disks: 105 GB 
-> 53 Storage: 5 ТВ 
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INFRA COMPARISON 


ELASTICSEARCH 


—> CPU: 44 

— Memory: 232 GB 
-> Main disks: 338 GB 
—> Storage disks: 25 ТВ 


LOKI 


> CPU: 7 

> Memory: 14 GB 
-> Main disks: 105 GB 
=> S3 Storage: 5 TB 


High 
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SAVINGS 


— бх less CPU 
— 16x less RAM 
— 3x less disk 


> 5x less main storage 


e 


—> RECAP 


RESULTS 


POSITIVE: 


1 4x logs infrastructure cost reduction 


2 Simplier infrastructure maintanance 


Much more stable installation than 
Elasticsearch 


4 Logs transfer in adequate timespan 


Easier to scale 
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RESULTS 


NOT SO POSITIVE, BUT OK: 


IRAN = 


Needed to use faster disks for S3 
Grafana is not Kibana 

No full-text search 

No Machine Learning 


Some queries in Loki are slower 


High 
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LOKI CONFIGURATION 
— CHECKLIST (BONUS) 


++ 


CONFIGURATION TIPS Losa 


Essential Grafana Loki configuration settings 


Share this: 


Essential Grafana Loki configuration settings 


19 33 Loki/Loki v2 Web Analytics Dashboard +: < ($h 


Total requests O Last 24 hours Requests per status code 


231k 


u, 


Users right now © Last 


173 "Nw t.ly/vMvk 


Countries right now 


=> | request for 
| request for 
| request for 
requ 


@ 


| request tor 
| request for 
Lrenuest far 


TECH TALKS - DECEMBER, 16 


MLOPS 
INENTERPRISES 
AND SMB 


-> 15:30 - 15:50 


Leave your feedback! 


You can rate the talk 
and give a feedback on 
what you've liked or 
what could be 


improved 
els s * 
++ @) @igorlatkin 
++ +++ 
Kë als sia sis dh sia gt in! linkedin.com/in/igorlatkin 
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Co-organizer 


Yandex 


